[SPARK-16189][SQL] Add ExternalRDD logical plan for input with RDD to have a chance to eliminate serialize/deserialize.#13890
[SPARK-16189][SQL] Add ExternalRDD logical plan for input with RDD to have a chance to eliminate serialize/deserialize.#13890ueshin wants to merge 4 commits intoapache:masterfrom
Conversation
…liminate serialize/deserialize.
|
Test build #61169 has finished for PR 13890 at commit
|
- because the Encoder for Java Beans needs not only getters but also setters.
|
Test build #61176 has finished for PR 13890 at commit
|
|
@marmbrus Could you review or assign someone to review this pr please? |
|
/cc @cloud-fan |
| } | ||
|
|
||
| /** Physical plan node for scanning data from an RDD. */ | ||
| private[sql] case class ExistingRDDScanExec[T]( |
There was a problem hiding this comment.
From the name it's hard to tell what's the difference between this one and RDDScanExec...
There was a problem hiding this comment.
How about renaming RDDScanExec to LogicalRDDScanExec ?
There was a problem hiding this comment.
LogicalRDDScanExec sounds weird because Logical means that it is a logical node but Exec means that it is physical node.
There was a problem hiding this comment.
how about ExternalRDDScan?
|
it's a pretty good optimization! should we also apply it to |
|
Hmm, I think we can't apply it to |
|
Test build #61950 has finished for PR 13890 at commit
|
|
|
||
| private[sql] object ExternalRDD { | ||
|
|
||
| def apply[T: Encoder](rdd: RDD[T])(session: SparkSession): LogicalPlan = { |
There was a problem hiding this comment.
Because I wanted to make the signature similar to the case class constructor.
Should I uncurry?
There was a problem hiding this comment.
Actually I'm not sure why we curry the constructor either. Since RDDScan does it, it's ok we follow it. But for this apply method, I don't see the value of doing it. cc @yhuai
There was a problem hiding this comment.
There is probably no reason to use multiple parameter lists here. We sometimes use it for case classes so that arguments that should not effect equality are not included in the generated equals method.
|
LGTM, cc @liancheng to take another look |
|
Test build #62008 has finished for PR 13890 at commit
|
|
retest this please |
|
will merge it once tests pass, thanks for working on it! |
|
Test build #62150 has finished for PR 13890 at commit
|
|
merging to master! |
|
@cloud-fan Thank you for merging this! |
What changes were proposed in this pull request?
Currently the input
RDDofDatasetis always serialized toRDD[InternalRow]prior to being asDataset, but there is a case that we usemapormapPartitionsjust after converted toDataset.In this case, serialize and then deserialize happens but it would not be needed.
This pr adds
ExistingRDDlogical plan for input withRDDto have a chance to eliminate serialize/deserialize.How was this patch tested?
Existing tests.